BiBERT: Accurate Fully Binarized BERT

139

TABLE 5.6

Quantization results of BEBERT on GLUE

benchmark. The average results of all tasks

are reported.

Method

#Bits

Size

GLUE

BERT-base

full-prec.

418

82.84

DynaBERT

full-prec.

33

77.36

DistilBERT6L

full-prec.

264

78.56

BinaryBERT

1-1-4

16.5

78.76

BEBERT

1-1-4

33

80.96

TinyBERT6L

full-prec.

264

81.91

TernaryBERT

2-2-8

28

81.91

BinaryBERT

1-1-4

16.5

81.57

BEBERT

1-1-4

33

82.53

Inspired by the empirical opinion in [3] that convolutional neural networks can improve

little accuracy if using ensemble learning after the KD procedures, the authors removed the

KD during ensemble for accelerating the training of BEBERT. Although the two-stage KD

performs better in [106], it is time-consuming to conduct forward and backward propaga-

tion twice. Ensemble with prediction KD can avoid double propagation and ensemble can

even remove the evaluation process of the teacher model. The authors further conducted

experiments to show whether applying KD in ensemble BinaryBERT has a minor effect on

its accuracy in the GLUE datasets, showing that BEBERT without KD can save training

time while preserving accuracy. They further compared BEBERT to various SOTA com-

pressed BERTs. The results listed in Table 5.6 suggest BEBERT outperforms BinaryBERT

in accuracy by up to 6.7%. Compared to the full-precision BERT, it also saves 15× and 13×

on FLOPs and model size, respectively, with a negligible accuracy loss of 0.3%, showing the

potential for practical deployment.

In summary, this paper’s contributions can be concluded as: (1) The first work that

introduces ensemble learning to binary BERT models to improve accuracy and robustness.

(2) Removing the KD procedures during ensemble accelerates the training process.

5.9

BiBERT: Accurate Fully Binarized BERT

Though BinaryBERT [6] and BEBERT [222] pushed down the weight and word embedding

to be binarized, they have not achieved to binarize BERT with 1-bit activation accurately.

To mitigate this, Qin et al. [195] proposed BiBERT toward fully binarized BERT models.

BiBERT includes an efficient Bi-Attention structure for maximizing representation infor-

mation statistically and a Direction-Matching Distillation (DMD) scheme to optimize the

full binarized BERT accurately.

5.9.1

Bi-Attention

To address the information degradation of binarized representations in the forward prop-

agation, the authors proposed an efficient Bi-Attention structure based on information

theory, which statistically maximizes the entropy of representation and revives the atten-

tion mechanism in the fully binarized BERT. Since the representations (weight, activation,

and embedding) with extremely compressed bit-width in fully binarized BERT have lim-